Extracting Route Directions from Web Pages

نویسندگان

  • Xiao Zhang
  • Prasenjit Mitra
  • Sen Xu
  • Anuj R. Jaiswal
  • Alexander Klippel
  • Alan M. MacEachren
چکیده

Linguists and geographers are more and more interested in route direction documents because they contain interesting motion descriptions and language patterns. A large number of such documents can be easily found on the Internet. A challenging task is to automatically extract meaningful route parts, i.e. destinations, origins and instructions, from route direction documents. However, no work exists on this issue. In this paper, we introduce our effort toward this goal. Based on our observation that sentences are the basic units for route parts, we extract sentences from HTML documents using both the natural language knowledge and HTML tag information. Additionally, we study the sentence classification problem in route direction documents and its sequential nature. Several machine learning methods are compared and analyzed. The impacts of different sets of features are studied. Based on the obtained insights, we propose to use sequence labelling models such as CRFs and MEMMs and they yield a high accuracy in route part extraction. The approach is evaluated on over 10,000 hand-tagged sentences in 100 documents. The experimental results show the effectiveness of our method. The above techniques have been implemented and published as the first module of the GeoCAM system, which will also be briefly introduced in this paper.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Dynamic Vision-Based Approach in Web Data Extraction

The problem of extracting data records on the response pages returned from web databases or search engines. World Wide Web has posed a challenging problem in extracting relevant data. Traditional web crawlers focus only on the surface web while the deep web keeps expanding behind the scene. Deep web pages are created dynamically as a result of queries posed to specific web databases. Extracting...

متن کامل

Extracting Attributes and Their Values from Web Pages

We propose a method for extracting attributes and their values from Web pages. Our method makes use of word distributions estimated from plain Web pages. The key idea is to estimate word distribution by consulting ontologies built from HTML tables. In a series of experiments, we show that estimated word distributions are useful for extracting attributes and their values in various kinds of HTML...

متن کامل

Optimized Content Extraction from web pages using Composite Approaches

The information available today on web is tremendous and comes with greater challenges. Content extraction identifies the main content and removes the clutter from web pages. The main problem in extracting the content from the web page is the newer architecture of web pages and the diversity in the structure of web pages. Optimized content extraction from HTML documents using collective approac...

متن کامل

ارزیابی کیفیت صفحات‌ وب پژوهشگاه‌های وابسته به وزارت علوم، تحقیقات و فن‌آوری‌ مستقر در شهر تهران از دیدگاه کاربران

Especially in research centers, evaluating the quality of web pages from clients' point of view has a constructive role in their design and development, since it makes the web developers familiar with client's perspective and assists them in designing client-oriented web sites in scientific and research environment. As a model for assessing the quality of web pages, "webQual" attempts to provid...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009